Goto

Collaborating Authors

 spatial transformer





Scalable Multi Agent Diffusion Policies for Coverage Control

Vatnsdal, Frederic, Camargo, Romina Garcia, Agarwal, Saurav, Ribeiro, Alejandro

arXiv.org Artificial Intelligence

Abstract--We propose MADP, a novel diffusion-model-based approach for collaboration in decentralized robot swarms. MADP leverages diffusion models to generate samples from complex and high-dimensional action distributions that capture the interdependencies between agents' actions. Each robot conditions policy sampling on a fused representation of its own observations and perceptual embeddings received from peers. T o evaluate this approach, we task a team of holonomic robots piloted by MADP to address coverage control--a canonical multi agent navigation problem. The policy is trained via imitation learning from a clairvoyant expert on the coverage control problem, with the diffusion process parameterized by a spatial transformer architecture to enable decentralized inference. We evaluate the system under varying numbers, locations, and variances of importance density functions, capturing the robustness demands of real-world coverage tasks. Experiments demonstrate that our model inherits valuable properties from diffusion models, generalizing across agent densities and environments, and consistently outperforming state-of-the-art baselines.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

This work addresses the question of how to improve the invariance properties of Convolutional Neural Networks. It introduces the so-called spatial transformer, a layer that performs an adaptive warping of incoming feature maps, thus generalizing the recent attention mechanisms for images. The resulting model requires no extra supervision and is trained back-to-back using backpropagation, leading to state-of-the-art results on several classification tasks. The paper is clearly written and its main contribution, the spatial transformer layer, is valuable for its novelty, simplicity and effectiveness. The related work section covers most relevant literature, except perhaps recent works that combine deformable parts models with CNNs (see for example "Deformable Part Models are Convolutional Neural Networks", "End-to-End Integration of a Convolution Network, Deformable Parts Model and Non-Maximum Suppression" both at cvpr 2015), since they also incorporate an inference over deformation or registration parameters, as in the spatial transformer case.


Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

Liu, Huijie, Wang, Jingyun, Ma, Shuai, Hu, Jie, Wei, Xiaoming, Kang, Guoliang

arXiv.org Artificial Intelligence

Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.


Reviews: Universal Correspondence Network

Neural Information Processing Systems

In this paper the authors proposed a universal correspondence network using CNN and pair-wise ranking loss. Overall this paper is well written and easy to follow. Below are the detailed review. In Line 121 the authors said they denote s 0 for positive pairs and s 1 for negative pairs. In such way positive pairs will be trained by margin-based ranking loss and negative pairs will be trained by mean squared loss.


AdaCred: Adaptive Causal Decision Transformers with Feature Crediting

Kumawat, Hemant, Mukhopadhyay, Saibal

arXiv.org Artificial Intelligence

Reinforcement learning (RL) can be formulated as a sequence modeling problem, where models predict future actions based on historical state-action-reward sequences. Current approaches typically require long trajectory sequences to model the environment in offline RL settings. However, these models tend to over-rely on memorizing long-term representations, which impairs their ability to effectively attribute importance to trajectories and learned representations based on task-specific relevance. In this work, we introduce AdaCred, a novel approach that represents trajectories as causal graphs built from short-term action-reward-state sequences. Our model adaptively learns control policy by crediting and pruning low-importance representations, retaining only those most relevant for the downstream task. Our experiments demonstrate that AdaCred-based policies require shorter trajectory sequences and consistently outperform conventional methods in both offline reinforcement learning and imitation learning environments.


STTM: A New Approach Based Spatial-Temporal Transformer And Memory Network For Real-time Pressure Signal In On-demand Food Delivery

Wang, Jiang, Wei, Haibin, Xu, Xiaowei, Shi, Jiacheng, Nie, Jian, Du, Longzhi, Jiang, Taixu

arXiv.org Artificial Intelligence

On-demand Food Delivery (OFD) services have become very common around the world. For example, on the Ele.me platform, users place more than 15 million food orders every day. Predicting the Real-time Pressure Signal (RPS) is crucial for OFD services, as it is primarily used to measure the current status of pressure on the logistics system. When RPS rises, the pressure increases, and the platform needs to quickly take measures to prevent the logistics system from being overloaded. Usually, the average delivery time for all orders within a business district is used to represent RPS. Existing research on OFD services primarily focuses on predicting the delivery time of orders, while relatively less attention has been given to the study of the RPS. Previous research directly applies general models such as DeepFM, RNN, and GNN for prediction, but fails to adequately utilize the unique temporal and spatial characteristics of OFD services, and faces issues with insufficient sensitivity during sudden severe weather conditions or peak periods. To address these problems, this paper proposes a new method based on Spatio-Temporal Transformer and Memory Network (STTM). Specifically, we use a novel Spatio-Temporal Transformer structure to learn logistics features across temporal and spatial dimensions and encode the historical information of a business district and its neighbors, thereby learning both temporal and spatial information. Additionally, a Memory Network is employed to increase sensitivity to abnormal events. Experimental results on the real-world dataset show that STTM significantly outperforms previous methods in both offline experiments and the online A/B test, demonstrating the effectiveness of this method.


Spatial Transformer Networks

Neural Information Processing Systems

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.